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Abstract 

In a recent paper [T], Klaere et al. modeled the impact of substitutions on arbitrary branches of a phylogenetic 
tree on an alignment site by the so-called One Step Mutation (OSM) matrix. By utilizing the concept of the 

m 

OSM matrix for the four-state nucleotide alphabet, Nguyen et al. |2| presented an efficient procedure to compute 
CO the minimal number of substitutions needed to translate one alignment site into another. The present paper 

o 

delivers a proof for this computation. Moreover, we provide several mathematical insights into the generalization 
of the OSM matrix to multistate alphabets. The construction of the OSM matrix is only possible if the matrices 
>^ representing the substitution types acting on the character states and the identity matrix form a commutative 

group with respect to matrix multiplication. We illustrate a means to establish such a group for the twenty-state 
amino acid alphabet and critically discuss its biological usefulness. 



1 Background 

Alignments of homologous sequences provide fundamental materials to the reconstruction of phylogenetic 
trees and many other sequence-based analyses (see, e.g., [3 4]). Each alignment column (site) consists of 
character states that are assumed to have evolved from a common ancestral state by means of mutations. 
Any combination of the character states in the aligned sequences at one alignment column represents a so- 
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called character f5\, which is sometimes also called site pattern fT]. Given a phylogenetic tree and an align- 
ment that evolved along the tree, Klaere et al. |[T|| showed, for binary alphabets, how a character changes 
into another character if a substitution occurs on an arbitrary branch of the tree. The impact of such a sub- 
stitution is summarized by the so-called One Step Mutation (OSM) matrix. Nguyen et al. |2| extended the 
concept of the OSM matrix to the four-state nucleotide alphabet while developing a method to evaluate the 
goodness of fit between models and data in phylogenetic inference. There, the OSM matrix is constructed 
based on the Kimura three parameter (K3ST) substitution model |6|. Nguyen et al. |2| illustrated how one 
can use Maximum Parsimony (i.e. apply the Fitch algorithm |7j) to compute the minimal number of sub- 
stitutions required to change one character into another character under the OSM setting. In the present 
paper, we deliver a proof for this transformation. 

In addition, the OSM matrix can be constructed only if the substitution matrices and the identity ma- 
trix form a commutative or Abelian group (see, e.g., l|8||9|) with respect to matrix multiplication ||2]|. We 
generalize the construction of the OSM matrix for any alphabet. Moreover, we show that the number of 
substitutions needed to convert one character into another may change if we use different groups. Finally, 
we provide a means to find an Abelian group for the twenty-state amino acid alphabet. 

2 Notation and Problem Recapitulation 

2.1 Notation 

Recall that a rooted binary phylogenetic X-tree is a tree T = (V(T), E(T)) with leaf set (also called taxon set) 
X = {1, . . . , n} C V{T) with only vertices of degree 1 or 3 (internal vertices), where one of the vertices of 
degree 1 is defined to be the root, and all edges are directed away from it. In this paper, when there is no 
ambiguity we often just write "phylogenetic tree" or "tree" when referring to a rooted binary phylogenetic 
tree. Also, when referring to a tree on a leaf set X with |X| = n, we write n-taxon tree for short. 

Furthermore, recall that a character / is a function f : X ^ C for some set C := {ci, C2, C3, . . . , c^} of r 
character states (r E N). We denote by C" the set of all r" possible characters on C and n taxa. For instance, for 
the four-state DNA alphabet, Cqjva = {^/ G, C, T} and the set C" consists of 4" elements. An extension of / 
to V{T) is a map g : V{T) C such that g{i) = f{i) for all i in X. For such an extension g of /, we denote 
by lj-{g) the number of edges e = {u, v} in T on which a substitution occurs, i.e. where g{u) 7^ gi'^)- The 
parsimony score of / on T, denoted by lj-{f), is obtained by minimizing lj-{g) over all possible extensions 
g. Given a tree T and a character / on the same taxon set, one can easily calculate the parsimony score 
of / on 7" with the famous Fitch algorithm |j7j. Moreover, when a character state changes along one edge 
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of the tree, we refer to this state change as substitution or mutation. As for our purposes only so-called 
manifest mutations are relevant, i.e. those mutations that can be observed and are not reversed, we do not 
distinguish between mutations and substitutions, which is why we use these terms s5monymously. 



2.2 Construction of the OSM matrix 

We now introduce the OSM framework in a stepwise fashion. The aim of the OSM approach is to determine 
the effects a single mutation occurring on a rooted tree 7" has on a character evolving on that tree. 

The first task of this approach is to formalize the term mutation and its effects on a single character state 
in C. A mutation is an operation a : C ^ C which is bijective, i.e. it satisfies the following condition: 

1. For all Ci E C there is acj E C such that cr(c,) = Cj, and if cr(c/) = c^(Cy), then c,- = cj. 

This guarantees that a mutation affects a character state in a unique fashion. It is well-known that any 
bijective operation on a finite discrete state set is isomorphic to a permutation (e.g., |10j). Therefore, in the 
following we consider mutations to be permutations. 

The next step is to establish which permutations we consider admissible in a model. In other words, we 
next establish conditions on the set S of permutations acting on C. 

2. For every pair c,, Cj E C there is exactly one operation tr G S such that cr(c,) = Cj. 

This guarantees that every character state change can be observed within a single step and that we do not 
have any ambiguity. If S contains the identity, i.e. the mapping ug such that i7b(c, ) = c, for all c, E C, then 
all other permutations in Z are fix-point free due to Condition 2. Condition 2 also implies that Z contains 
exactly r permutations, where r is the number of character states in C If S had more permutations then for 
all states c,- E C there would be a pair of distinct permutations o'i,a2 E S such that o'i{cj) = diici), which 
would lead to ambiguity. Condition 2 also concludes that we exclude GTR pTj from the set of admissible 



models. However, we explain this more in-depth in Section 3.3 



We add some more useful conditions which give H a very convenient structure: 

3. For all 0'i,a2 E S also the product aioa2 E E. In other words, S is closed with respect to concatenation 
of its permutations. 

4. For all o'i,o'2 E S we have a^o a2 = ^20 ai. Thus, S is commutative, and hence the order in which we 
assign permutations is irrelevant for the outcome. 
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5. There is an element ctq G E such that for all cri e Z, we have Cio (Tq — (Tq o ai — ai. As pointed 
out above, including the identity guarantees that aU other permutations wiU force a state change, a 
feature which led to the name "One Step Mutation". 

6. For every Ci & T, there is a (72 e Z, such that o'iocr2 — (Tq. The existence of such an inverse element 
guarantees that every operation can be reversed within a single step, which is quite a useful property. 

7. For all di, t72, E Tj we have ai o (c72 o (73) = [ai o 02) 01/3 = (Ti o c72 o (73. Associativity is needed to 
enforce a group structure on Z,. 

All of these conditions taken together imply that E forms an Abelian group of r permutations. From now on 
we use the matrix form of permutations for illustration of the operations. A permutation matrix cr over C is 
represented by an r x r matrix such that cr^.^^. = 1 if cr{ci) = Cj, and otherwise. In that case, a concatenation 
"o" is eqmvalent to the matrix multiplication "•". 

Example 2.1. In genetics, the most commonly used character state set is Cd^a — {-^z G, C, T}. There are two 
different Abelian groups for four states, namely the Klein-Four-group Z2 x Z2 and the cyclic group Z4. The Klein- 
Four-group is constructed from the cyclic group over two elements, identity (tq) and flip (tq). These take the matrix 
form 



To = 



The Klein-Four-group consists of the four Kronecker products of these two matrices, i.e. sq — Tq® Tq, s\ — r\® 
Tb/ S2 = To (8> Ti, and S3 = Ti (g) Ti. In full, they take the form: 
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This construction coincides with the K3ST model of substitution, where si describes transitions within purines (A, G) 
and pyrimidines (C, T), S2 represents transversions within pairs (A, C) and (G, T), and S3 represents the remaining 
set of transversions within pairs (A, T) and (C, G). 
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The cyclic group is given by the permutation set 



A 
C 
G 



A C G T 

'^O 1 0^ 

10 

1 



S 2 



S 1 ■ S 1, S3 



1/ So 



^1. 



T \0 1 0/ 

Note that the cyclic group Z4 has a different interpretation with a different ordering of the nucleotides. E.g., our matrix 
s'l yields the rotation A^G^T^C^A, while Bryant ill] uses the rotation A^C^G^T^A. The 



cyclic group associated to the latter rotation 112] is linked to the K2ST substitution model 1 13 f, where s'2 corresponds 
to the transition within purines and pyrimidines, and s'l and s'3 are the (not further distinguished) transversions. 

The next step in constructing the OSM matrix is to construct a set of operations over C" governed 
by T, and based on the permutation set S. To this end, we first define as a set of operations which work 
elementwise, i.e. for / = (cj, . . . , c„) G C" and cr G we have 

£r(/) := {ai{ci),...,a„{cn)), (Ti E E. 

This can also be described by the Kronecker product, i.e. equally 

(r(/) =C7i® ■■■®(7„(/). 

This means that there are r" different operators inS" = Zc3-(g)S. Thus, for any pair of characters 
f,g E C" we can find an operation cr G Z" such that <t(/) = g. 

Another noteworthy consequence of using the Kronecker product is that the elements of are per- 



mutations over C 1 14 15 1, and in fact satisfies our Conditions 1-7, i.e. is an Abelian group over 
C". 

In the OSM framework we assume that the permutations acting on a character f E C" are derived from 
the underlying rooted tree T. In particular, regard the pendant edge to a Taxon 1 G X. A permutation on 
this edge will only affect the character state of /i . Therefore, the edge implies permutations of type 



)(7o ■ 



) (Tq, f = 1, . . . ,r — 1. 



This construction works analogously for all pendant edges, with all but one factor being the identity ctq, 
while the one non-identity factor is one of the remaining r — 1 permutations in Z. A permutation at an 
interior edge affects the character states of all its descendants, i.e. those taxa whose path to the root passes 



that edge. E.g., assume Taxa 1 and 2 form a cherry, i.e. their most recent common ancestor has no other 
descendants, and permutation u, E S, f — 1, . . . , r — 1 is acting on the edge leading to this ancestor. Then, 
we get the permutation 

£r^2'' ■= o-i (g) ai (g) ao ■■■ (g) (70 = (r^'' g) cr^''. 

The right hand side equation shows that a single permutation on an internal edge has the same effect as 
simultaneously applying the same permutation on the pendant edges of all descendant taxa. This also 
shows that the set of all permutations on the pendant edges is a generator of S", i.e. the closure of 
contains all permutations in Z". Since contains a single permutation to transform character / e C" into 
g G C", and since generates S", there is a shortest chain of permutations in which transforms / into 
g. is also the set of permutations implied by the star tree for X. For every X-tree T we have TP' D Z^, 
and therefore TP' is a generator for H", too. An illustration of such a generator set TJ over the character set 
C" is the so-called Cayley graph p6) , which has as vertices the elements of C", and two elements f,gEC" are 
connected if there is a permutation c E such that c{ f) = g- In |1 1 Cayley graphs have been presented 
as alternative illustrations of the tree T over a binary state set C. 



Example 2.2. Regard the K3ST model from 2.1 and the rooted two-taxon tree depicted in Figure 1(a). With this 
^K3ST '® S^'^sn by the set 



s 



''1'^ := si ® So, s"^''^ := So (g si, s^'^^'^ := si Si, 



s''i'2 := S2 ® So, s'^^'^ := So ® S2, s'^'^^'^ := sj ® S2, 

s^'i'^ := S3 0SO, :=So®S3, s'^^'^ := S3 (g) S3. 

Each operation is thus a symmetric 16 x 16 permutation matrix depicting a transition (s'^'^), transversion 1 (s'^'^), or 
transversion 2 (s*^'^) along edge e E E{T). Figure 1(b), (c), and (d) display the permutation matrices for a transition 
on branch e\ (s'^^'^), £2 (5*^2,1 j ^jj^ (s"^^^'^), respectively. Figure 2(a) shows the Cayley graph associated with IL^^^j. 

We are now in a position to recall the definition of the OSM matrix Mj- for a rooted binary phylogenetic 
tree T as explained in |[lj and |jl7j. For an edge e E E{T) we denote by the relative branch length of e, 
i.e. its actual length divided by the length of T. Thus, one can view as the probability that a mutation is 
observed at edge e given that a mutation occurred on T. Clearly, Y^g^E{T) Pe ~ ^- Further, denote by a^, , the 
probability that this mutation on e is of type i E {1, . . . , r — 1} with J^'j^^ Ci^j = 1 for all e E E{T). Then the 
OSM matrix is the convex sum of the elements in TP' , where each permutation cr"'' is multiplied by p^aj, 
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the probability of hitting the edge e with permutation ct-; g S. Thus, we obtain: 

Mj- can be regarded as the weighted exchangeability matrix for all characters given that a substitution 
occurs somewhere on the tree T. Figure 1(e) depicts the OSM matrix for the tree in Figure 1(a). Here, 
colors indicate relative branch lengths pe, and patterns denote permutation types a;. E.g., a blue square 
with horizontal lines indicates the product pe^ag^,!, i.e. the probability of observing a Transition si on Edge 

62- 

2.3 The transformation problem 

With the construction of TP' we have generated the tools needed to formally describe the computations in 
Step 4 of the MISFITS method introduced by Nguyen et al. \2\. Given a rooted tree T and two characters / 
and iaC", we want to compute the minimal number of substitutions required on the tree to convert / 
into f^. 

In our framework this corresponds to finding the smallest number k of permutations tr^, . . . , cr^ £ YJ" 
such that P"! o ■ ■ ■ o (T/f (/) = f^- The number k has multiple equivalent interpretations. It is also the length 
of the shortest path between / and in the Cayley graph for TJ, where this path corresponds exactly to 
the chain trj o • • • o cr j. since each edge in the Cayley graph corresponds to an operation in YP' . k is also the 
smallest matrix power such that Mj- — for j < k and Mj- > 0, because a positive entry in M^j- means that 
there is a concatenation of k permutations connecting the associated characters. 

Nguyen et al. |2j| presented an efficient procedure to compute the minimal number of substitutions as 
summarized in Algorithm [ij and we prove its correctness in Section 3.1 



Algorithm 1. INPUT: rooted binary phylogenetic tree T on leaf set X, characters f and on X, group Z. 

ITERATION 1: Align characters f and and find the corresponding substitution type at which translates fj 
into f^ for all positions j = 1, . . . , |X|. Let cr E be the resulting operation. 

ITERATION 2: Let h := Ci . . . Ci be the constant character on X with some c-[ E C on all positions. Apply cr to 
h and call the derived character c. 

ITERATION 3: Calculate m:=lr{c). 

OUTPUT: Minimum number m of substitutions needed to evolve instead of f on T. 

Example 2.3. Figure 3 demonstrates how Algorithm^works under the K3ST model, i.e. when the group is Z = 
^K3ST (Figure 3(a)). Consider the rooted five-taxon tree in Figure 3(b) and the character GTAGA at the leaves. 



Assume that the character GTAGA is to he converted into character ACCTC. By comparing the two characters 
positionzvise, we need a substitution si on the external branch leading to Taxon 1 to convert G into A at the first 
position. Similarly, we need a substitution si on the external branch leading to Taxon 2, and a substitution S2 on 
every external branch leading to Taxa 3, 4, and 5. Thus, the operation s := (si,si,S2, 82,82) transfers the character 
GTAGA into the character ACCTC. As the operation s := (81,81,82,82,82) also translates the constant character 
AAAAA into GGCCC, converting GTAGA into ACCTC is equivalent to evolving the character state A at the root 
along the tree to obtain the character GGCCC at the leaves. The Fitch algorithm applied to the character GGCCC 
with the constraint that the character state at the root is A produces a unique most parsimonious solution of two 
substitutions as depicted by Figure 3(c). 

3 Results 

3.1 The impact of parsimony on the estimation of substitutions. 

In this section, we provide some mathematical insights into the role of Maximum Parsimony in the estima- 
tion of the number of substitutions needed to convert a character into another one as explained in Section 
2.3 In particular, we deliver a proof for Algorithm |T] 



Theorem 3.1. Let T bea rooted binary phylogenetic tree on taxon set X and let f bea character that evolved on T 
due to some evolutionary model and let he another character on X. Then, the number of substitutions to be put on 
T which change the evolution of f in such a way that evolves instead of f can he calculated with Algorithm^ 

Proof. Let /, , X, T and S be as required for the input of Algorithm[l] Then, the number of substitutions 
needed to evolve on T rather than / depends solely on operation cr. In order to see this, note that cr 
describes an explicit way to translate / into step by step, i.e. for each taxon seperately. The basic idea now 
is that in order to minimize the number of required substitutions, we need to consider the underlying tree 
T, as this may allow a single substitution to act on an ancestor of taxa that undergo the same substitution 



type rather than on each taxon separately. This idea has been described above in Section 2.2 and it coincides 
precisely with the idea of the parsimony principle. 

However, in order to avoid confusion regarding the operation cr as a character on which to apply parsi- 
mony. Algorithm [1] instead acts on the constant character. Clearly, in order to evolve the constant character 
:= Ci ■ ■ ■ Ci on a tree with root state Ci, the corresponding operation would be ^ := erg o ■ ■ ■ o tTg. If instead 
of & we let cr act on h, the resulting character c will differ from h in the same way differs from /. Note 
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that two character states in c are identical if and only if the corresponding substitutions in cr are identical, 
too. Therefore, it is possible to let MP act on c rather than directly on cr. 

By the definition of Maximum Parsimony, when applied to c on tree T with given root state ci, it 
calculates the minimum number m of substitutions to explain c on 7". This number m is therefore precisely 
the number of substitions needed to generate c on T rather than h. But as explained before, c and h are 
by definition related the same way as / and Therefore, m also is the number of substitutions needed to 
generate on T rather than /. This completes the proof. □ 

3.2 The impact of different groups 

For any alphabet, there might be more than one Abelian group. Different groups might result in different 
numbers of substitutions required to translate a character into another character. We illustrate this in the 
following examples. For the four-state nucleotide alphabet there are two Abelian groups, namely the Klein- 
four group and the cyclic group (see above). The cyclic group ILc consists of the identity matrix Sg and the 
three substitution types s'^, 82,83 depicted by Figure 4(a). Hence, He = Wo'^i'^i'^s}- Under Zc, we note that 
a substitution type which changes a character state c/ to Cj does not necessarily change Cj to c,. 

Example 3.2. Assume the rooted five-taxon tree in Figure 4(b) and the character GTAGA at the leaves, which is to 



he converted into character ACCTC. The tree and the two characters are the same as in Example 2.3 By comparing 
the two characters positionwise under the group Tjc, we need a substitution 83 (depicted in blue in Figure 4(a)) on the 
external branch leading to Taxon 1 to convert G into A at the first position. Analogously, we need a substitution 
on the external branches leading to Taxon 2 and to Taxon 4, and a substitution 83 on the external branches leading 
to Taxon 3 and to Taxon 5. Thus, the operation s' := (83,8^,83,8^,83) transfers the character GTAGA into the 
character ACCTC. As the operation s' also translates the constant character AAAAA into CGCGC, converting 
GTAGA into ACCTC is equivalent to evolving the character state A at the root along the tree to obtain the character 
CGCGC at the leaves. The Fitch algorithm applied to the character CGCGC with the constraint that the character 
state at the root is A produces a unique most parsimonious solution of three substitutions as depicted by Figure 4(c). 
Thus, under the group we need one substitution more than under the S^SST group. 

Note that variation of the minimum number of substitutions needed to translate a character into an- 
other one between different groups is not surprising: As different substitution types are needed to translate 
one pattern into the other one, depending solely on the underlying group, one group might need the same 
substitution type for some neighboring branches in the tree and another group different ones. Informally 
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speaking, this would imply that in the first case, the substitution could be "pulled up" by the Fitch algo- 
rithm to happen on an ancestral branch, whereas in the second case this would not be possible. 



3.3 The link between substitution models and permutation matrices 



In Examples 2.1 and 2.2 we have shown that the K3ST substitution model can be included into our frame- 
work. This section aims at discussing alternative models and how to identify their use (or lack thereof) for 
our approach. The set TP' contains a set of permutations which act on the characters in C". 

Most substitution models assume the independence of the different branches of a tree to compute the 
joint probability of the characters in C". Therefore, they use the probabilities for substitutions among the 
character states in C along the edges of the tree T. We now establish a probabilistic link between TP' and 
C". This link is provided by Birkhoff's theorem: 



Theorem 3.3 (Birkhoff's theorem, e.g., p8| . Theorem 8.7.1). A matrix M is doubly stochastic, i.e., each column 
and each row of M sum to 1, if and only if for some N < oo there are permutation matrices ai,...,afq and positive 
scalars aj, . . . , a^j e K such that a.^ + ■ ■ ■ + a,, = \ and M = ci\(T\ + ■ ■ ■ + oc-^cr-^. 

Therefore, the weighted sum of the permutation matrices in yields a doubly stochastic matrix M7- as 



introduced in Section 2.2 M7- also describes a random walk on C" governed by T where the single step in 
C" is illustrated by the associated Cayley graph. Its stationary distribution is uniform, i.e. when we throw 
sufficiently many mutations on T then we expect to see each pattern with probability 1 /r" . 

Another, even more useful consequence of Birkhoff's theorem is the fact that it tells us which substitu- 
tion models are suited for the OSM approach. If the transition matrix associated with the model is doubly 
stochastic, then we find a set of permutations which give rise to the model. 

Let us see how this influences the symmetric form of the general time reversible model (GTR). It has the 
transition matrix 



- GTR 
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Assigning permutation matrices to the respective parameters yields the set Sgtr with elements sq (identity) 
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The weighted sum of the non-identity elements yields 

fd + e+f 

asa + bs}, + cSc + dsd + ese+ fsf = 



V 



a 

■c+f 



- c - 

f 



c 
e 

f 

a + b + dj 



which is equal to Pgtr because a + b + c + d + e+ f = 1. Thus, the set Sgtr is to GTR what Srsst 
is to K3ST. However, Sgtr does not satisfy Condition 2, because it contains more than four elements. 
Therefore, it creates ambiguity since for each nucleotide there are three permutations which do not change 
the nucleotide. It is also not commutative (Condition 4) which means the order in which we assign the 
permutations matters. And it is not closed under matrix multiplication (Condition 3), which means that 
a concatenation of permutations in Sgtr might lead to a new permutation not in SgtR/ i e- we would 
encounter a new mutation type. All of this shows why the overall applicability of GTR to the OSM approach 
is rather limited. More complex models like Tamura-Nei 1 19] do not even permit the decomposition of its 
transition matrix into the convex sum of permutation matrices. However, including the concept of partial 
permutation matrices flU can address this problem. While this approach is interesting for future work, it 
is beyond the scope of this paper. 



3.4 Application to other biologically interesting sets 



As stated in Section 2.2 the OSM model only requires an underlying Abelian group. Thus, the OSM setting 
is applicable not only to binary data or four-state (DNA or RNA) data, but also to doublet, codon, and 
amino acid characters. 

In particular, there are four Abelian groups for the twenty-state amino acid alphabet, namely Z2 x Z2 x 
Z5, Z4 X Z5, Z2 X Zio, and the cyclic group Z20 (see e.g., |9| for a complete list of all groups with up to 
35 elements). Their construction is analogous to the construction of the Klein-Four group in Example 



2.1 



For example, the elements of Z4 x Z5 are Kronecker products of one of the four permutations in the cyclic 
group Z4 with one of the five permutations of the cyclic group Z5. 



11 



Figure 5 depicts the 20 substitution types, i.e. the 20 operations including the identity acting on the 
amino acid character states for all four Abelian groups. If we assign probabilities to the substitution types 
in the matrices, the resulting matrices are doubly stochastic. The matrices show several features of the 
groups, e.g. that contrary to the Klein-Four group the elements of the group are not self-inverse but instead 
the effect of a permutation is reversed by a different mutation. Such events are present in some models of 
nucleotide evolution, like the strand symmetric model |20 |, and relatively common in amino acid models 
where the transition matrix is generated by, e.g., counting mutation types in amino acid alignments (see, 
e.g., |21[ for an overview). It might be interesting to see whether any of these can be fitted. The illustrations 
in Figure 5 also suggest some ordering of the amino acids to fit the model. For instance, Z2 x Z2 x Z5 and 
Z4 X Z5 seem to partition the sets into four groups with five elements each. 

Conclusions 

In this paper, we provide the necessary mathematical background for the OSM setting which was intro- 
duced and used previously p)[T7), but had not been analyzed mathematically for more than two character 
states. Moreover, the present paper also delivers new insight concerning the requirements for the OSM 
model to work: In fact, we were able to show that mathematically, it is sufficient to have an underlying 
Abelian group - which shows a generalization of the OSM concept that was believed to be impossible 
previously |j2|. Therefore, we show that OSM is applicable to any number of states. 

However, note that the original intuition of the authors in |2| was biologically motivated: The authors 
supposed that the group not only has to be Abelian, but also symmetric in the sense that each operation 
can be undone by being applied a second time. Thinking about the DNA, for instance, this works: For 
example, the transition from A to G can be reverted by another substitution of the same type, namely a 
transition from G to A. This symmetry criterion is fulfilled by the Klein-Four group, but not by the cyclic 
group on four states. Unfortimately, for 20 states there is no Abelian group fulfilling this criterion, which is 
why the demonstrated generalization to 20 states does not provide a nice symmetry (rf. Figure 5). There- 
fore, it remains unclear at this stage if there are biologically motivated settings for which our twenty-state 
generalization is directly applicable. 
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Figures 

Figure 1- Construction of the OSM matrix 

Figure 2- The Cayley graph for the two-taxon tree from Figure 1(a). 

Figure 3- Computing the minimal number of substitutions to translate a character into another one. 

Figure 4- Converting one character into another character using the cyclic group. 

Figure 5- Matrices illustrate the four Abelian groups for the twenty-state amino acid alphabet. 
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(d) (e) 
Figure 1. 



Figure 1: (a) A rooted tree with Taxa 1 and 2. (b) A transition Si on the left branch (the red branch) 
changes a character into exactly one new character as depicted by the red horizontal stripe cells of the 
permutation matrix a'^^'^^. The matrix has 16 rows and 16 columns representing the possible characters for 
the alignment of two nucleotide sequences. The permutation matrices generated by si for the right branch 
£2 (blue) and for the branch leading to the "root" ei2 (green) are displayed in (c) and (d), respectively. The 
convex sum of all the weighted (by the relative branch length and the probability of the substitution type) 
permutation matrices generated by all substitution types for all branches is the OSM matrix of the tree (Mf) 
as shown in (e). Horizontal stripe cells represent the probability of the Transition s^; diagonal stripes the 
Transversion S2; and thin reverse diagonal stripes the Transversion S3. The colors of these cells indicate the 
relative branch lengths and follow the colors of the branches as in (a). Thus, these colors also depict the 
branch origin of the substitutions. 
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(a) (b) 
Figure 2. 



Figure 2: The vertices depict the characters in C^j^^. Two vertices are connected by an edge if there is 
a permutation in transforming one of the associated characters into the other, (a) depicts the Cayley 
graph for the Klein Four group Z2 x Z2, and (b) depicts the Cayley graph for the cyclic group Z4. 
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Figure 3. 



1 



Figure 3: (a) depicts the Klein-four group SicasT/ which consists of the identity sq and the three substitution 
types si,S2,S3 from the K3ST model, (b) In order to convert the character GTAGA into ACCTC under 
^K3ST/ we need to introduce the operation s := (si,Si,S2,S2,S2)- As the operation s also translates the 
constant character AAAAA to GGCCC, converting GTAGA into ACCTC is equivalent to evolving the 
character state A at the root along the tree to obtain the character GGCCC at the leaves. The Fitch algorithm 
applied to the latter produces a unique most parsimonious solution of two substitutions as depicted by (c). 
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Figure 4. 
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Figure 4: (a) depicts the cyclic group Sc, which consists of the identity Sg = sq and the three substitution 
types s^, S3 for nucleotide character states, (b) In order to convert the character GTAGA into ACCTC us- 
ing this group, we need to introduce the operation s' := (63,82,83,3^,33). As the operation s' also transforms 
the constant character AAAAA to CGCGC, converting GTAGA into ACCTC is equivalent to evolving the 
character state A at the root along the tree such that the character CGCGC is attained at the leaves. The 
Fitch algorithm applied to the latter produces a unique most parsimonious solution of three substitutions 
as depicted by (c). 
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Figure 5. 
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Figure 5: (a) the Z2 x Z2 x Z5 group, (b) Z4 x Z5, (c) Z2 x Zio, and (d) Z2o- In each matrix, the 20 
different colors ranging from light yellow to dark red can be regarded to represent 20 substitution types, 
i.e. 20 operations including the identity acting on the amino acid character states or the corresponding 
probabilities of these substitution types. In the latter case, the matrices are all doubly stochastic. 
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